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Abstract. Multimodal structures in the sampling density (e.g. two competing phases) can be a 
serious problem for traditional Markov Chain Monte Carlo (MCMC), because correct sampling of 
the different structures can only be guaranteed for infinite sampling time. Samples may not decouple 
from the initial configuration for a long time and autocorrelation times may be hard to determine. 

We analyze a suitable modification [1] of the simulated tempering idea [2], which has orders of 
magnitude smaller autocorrelation times for multimodal sampling densities and which samples all 
peaks of multimodal structures according to their weight. The method generates exact, i. e. uncor- 
rected, samples and thus gives access to reliable error estimates. Exact tempering is applicable to 
arbitrary (continuous or discreet) sampling densities and moreover presents a possibility to calculate 
integrals over the density (e.g. the partition function for the Boltzmann distribution), which are not 
accessible by usual MCMC. 

INTRODUCTION 

Simulated Tempering was introduced in Ref. [2], parallel tempering, also known as 
Replica Exchange Monte Carlo, in Ref. [3, 4] and both have been widely used (see 
e. g. Refs. [5, 6]) to make Markov chain Monte Carlo faster. For an introduction to both 
methods see Ref. [7]. 

Propp and Wilson introduced the coupling from the past (CFTP) method to draw exact 
samples, i.e. samples which are guaranteed to be uncorrelated and to obey the desired 
distribution in Ref. [8]. Applications of this method to continuous degrees of freedom 
and cluster algorithms exist, see Refs. [9] and [10]. 

A small modification of the Simulated Tempering algorithm likewise allows to obtain 
uncorrelated samples, see Ref. [1]. The first two sections give an introduction to this pro- 
cedure and the resulting error estimates. We then examine the tempering Markov matrix 
and the autocorrelation time in and give indications about the needed parameters. We 
investigate the tempering algorithm for multidimensional continuous distributions and 
find a polynomial dependence on the dimension. Finally, we compare exact sampling 
with simulated tempering to the CFTP method for the two dimensional Ising model and 
find a quadratic dependence on the system size for simulated tempering, while CFTP 
needs exponential time for cold but finite temperatures. 



EXACT SAMPLING WITH SIMULATED TEMPERING 

Besides speeding simulations up, Simulated Tempering provides an alternative way to 
obtain exact samples from arbitrary probability density functions [1]. Fig. 1 shows the 



principle for a multi-modal distribution pi(X) consisting of two Gaussians, but it does 
not depend on the specified example and can thus be applied to a variety of probability 
distributions. We want to draw Exact samples from the distribution p\{X), which we 
can not sample directly, where X can be a discrete or continuous quantity of arbitrary 
dimension. In order to do so, we introduce an additional parameter /3 and the joint 
probability p(X,j3) = p(X|j8)p(/3). We have large freedom in choosing p(X,/3), for 
the simulation depicted in fig. 1, we chose: 

p(x,p m ) = ^ Pl (x)^po(xy-^, (i) 

where Z is the overall normalization, Z m is a constant depending on /3 m , which deter- 
mines p(j8 m ). The additional variable /3 was allowed to take M+l discrete values /3 m 
with /3o = and /3 M = 1. po(X) should be chosen in a way to allow generating Exact 
samples easily; in our example, it was a single broad Gaussian peak. Furthermore, its 
range in X-space should be broad enough to cover all structures of p\ (X). For applica- 
tion to physical systems, p\ would of course be chosen as oT E ^ (with E(X) denoting 
the energy) and po = const., which yields p(X\f3) = e~P E ( x ). In this case, f3 M is not 
chosen to be one, but can take any other value /3 max . 

We then do Markov chain Monte Carlo in the {X,/3}- space, where we alternate 
a couple of sweeps in X-space with moves in /3-direction. In /3-direction, /3 m / with 
m' = m ± 1 is proposed with equal probability and accepted according to the Metropolis 
scheme. In Z-space and for /3 m ^ 0, usual Metropolis updates are employed. A special 
case arises for X-moves at /3 m = 0. In this case p(X\fi = 0) p (X) and we are able to 
draw a new exact sample X' distributed according to po(X), which gives us a sample X' 
uncorrelated from X. 

An example of the resulting random walk is depicted on the 'floor' of Fig. 1. When- 
ever this random walk reaches /3 = 0, a new exact sample from po is drawn independent 
from the current state of the Markov chain so that the walk forgets its past. The MC time 
needed for one exact sample is thus given by the time needed by the Markov chain to 
travel from /3o = to /3m = 1 and back again. 

A plain MCMC run would instead be trapped in one of the two peaks and rarely tunnel 
to the other. Repeating several plain MCMC runs and taking their average would give 
the wrong expectation value x = 0, because the different weight of the peaks would not 
be accounted for. 



EXPECTATION VALUES AND ERROR ESTIMATES 

As the {X,/3}-samples obtained by the simulation obey p(X,j8), theX drawn at a given 
temperature /3 m obeys p(X|j3 m ). Expectation values for /3m = 1 are therefore calculated 



FIGURE 1. Example for a Simulated Tempering run. On the 'floor', the Markov chain travels through 
the {x,j3}-space, the larger dots are the obtained samples, the dotted lines show the way the Markov 
process has taken. Via j3o = 0, the walk reaches both peaks at Pm = 1> although no direct tunneling between 
them occurs. The peaks (solid lines) are the probabilities p(x\j5) for the various discrete /3-values. The 
samples drawn at a certain temperature obey this distribution. On the right hand 'wall', the vertical axis is 
the time axis of the simulation; one sees the wandering of the random walk through the temperatures. The 
thick lines are inserted where the walk reaches j3o = 0, i. e. where an independent exact sample is drawn 
from po = p{x\P = 0) (chosen as a single broad Gaussian peak). At these points, the walk forgets its past 
and a new uncorrelated bin starts. 



from all (correlated and uncorrelated) samples obtained at a given temperature: 

i Nm 1 ^M.ind Ni 

*(/3m = i) = — E*y = TT- E E^v 

N M j=l N M i=l v=1 

(X(Pm = 1)) = ^(Xj) = (X) 
j 



The Xj are the measurements obtained at the desired temperature /3m = 1, their index 
j was broken into i and v with i denoting the independent and uncorrelated bins and v 
labeling the correlated measurements within one bin, see Fig. 1. Nm,m ls me number 
of independent bins which contain at least one sample drawn at /3m, and Ni the number 

of measurements within the i-th bin. Nm = Y^i™* Ni is the total number of times the 
simulation has visited the desired temperature /3m = 1. A bar denotes the sample mean 
obtained in the Monte Carlo run, (...) denotes an expectation value over all samples. 
The sample mean is obviously unbiased. 

It is worth noting that measuring the bin averages does not give the same result, 
because the probability for a move in /3 -direction, and thus the number of measurements 



(Ni) taken in a bin before the walk returns to /3 = 0, is a random variable and depends 
on the current sample X: 



£ L v= i* l ,v ) ^ { J_f U=l X >,v ) = ( Uv^,v ) = (x) (3) 




Here, N^n = ^ L ( -=i ^ is the average number of measurements per bin. For the same 
reason, taking only the first sample of each bin does not give correct results. For a multi- 
modal pi (X) with a different height (and/or width) of the peaks as in Fig. 1, the Markov 
Chain may visit the smaller peak very often, but it will stay at the larger one longer. 

The independent samples provide a way to analyze correlations and to calculate 
reliable error estimates [1]. When calculating the variance of the estimate X, the new 
labels i and v become useful as it is now important to distinguish between correlated 
and uncorrelated samples: 

i Ni Nj 

iy i,j V=ljU=l 

= ^E< E E ax, v ax„) + ^T<E^E^-,v> +f(E^> = 

iV ij v=1m=1 iV y f,v iV f j 

=iV =A^2 

= ^E< E ^.W^EiE^lE^+^iE^+W 

iV i v,/i=l iV i^j v=l ^=1 iV i,v 

=0 " ^0 ' " ^0 ' 

(4) 

where (Lv,u AX ; - 5 vAXj ;jU ) = (£ v AX ; - 5 y) (Lit &Xj,n) f° r ' 7^ 7> because the measurements 
are from different bins, (£^ =1 AX ; V AX !)jU ) is independent of i, because all bins are 
equivalent. From Eq. 4, it follows for the variance 

var(X) = (X 2 ) - (X) 2 = (X 2 ) - (X) 2 = ^( £ AX ( - V AX^) . (5) 

^ V v,n=l 

The unknown expectation value (Lv'^=i AX, )V AX !)jU ) is estimated from the Monte Carlo 

run, thus (I^ M=1 AX ( - v AX i - M ) est « ^l£fl^ =1 AX,- yAX^. However, the variance 
depends on the determination of the above expectation value, so it can only be correct, 
if all modes of p\ have been sampled sufficiently. Similar formulae can be derived for 
the expectation values and error estimates of more complex observables (e.g. of the 
covariance), where correlations between the measured parameters can thus be taken into 
account. 



2 



300 



N = 2 (p=0 and p=1 only) 




o 
< 



200 





> 



+ 

V o ° 
so * 

X 

o/ X 



"O 



100 



C\J 



o 



40 



o 



X.-' 



40 



400 

distance between peaks 



1600 



FIGURE 2. Autocorrelation time for the Simulated Tempering algorithm in ID depending on the 
distance between the two peaks for two to five /3 -slices. The distance is measured in multiples of the 
width a of the Gaussian. 



Although nobody would think of using Monte Carlo simulation for one dimensional 
problems, as much more efficient approaches are available, it is interesting to examine 
the Markov matrix for a Simulated Tempering simulation in the two-dimensional X-fi- 
space with discretized X. The probability density p\ (x) for /3 = 1 was chosen to consist 
of two Gaussians well separated from each other and po(x) was chosen to be constant. 
For Simulated Tempering, the number of /3-slices was varied from two (just /3 = with 
p(X\[3 = 0) = po and /3 = 1 with p(X\fi = 1) = p\) to five. The intermediate /3 -values 
were chosen so as to give approximately the same transition rate between all pairs of 
adjacent /3 -values. Autocorrelation and thermalization are largely determined by the 
second largest eigenvalue (e^) of the Markov matrix, i. e. the one with magnitude closest 
to one. The autocorrelation time was approximately calculated as %ac ~ 1/(1 — \ e 2\)- 

Fig. 2 shows this autocorrelation time as a function of the distance of the two peaks. 
One sees that more /3-slices become necessary as the distance increases. For plain 
Markov chain Monte Carlo in the one-dimensional discrete X-space, the autocorrelation 
time far exceeded the range plotted in Fig. 2 even for a distance of d = 12 (t^c ~ 
2.6499e + 03) and its calculation is numerically instable for larger distances. 

The columns of the Tempering Markov matrix which correspond to /3 = are iden- 
tical, which means just that whenever the current state of the chain is at /3 = 0, the 
outcome of the next move will not depend on the current position in X-space. 



BEHAVIOR IN ONE DIMENSION 



PARALLEL TEMPERING 



Another method similar to Simulated Tempering Parallel Tempering, also called Ex- 
change Monte Carlo, see Refs. [3, 7]. In this method, we have M copies of X at the 
M values for /3. Instead of the space {X,/3 m } as in Simulated Tempering, we now con- 
sider the product space {Xq,X\, . . . ,X m , . . . ,Xm} where the configuration X m is at the 
temperature /3 m . At every /3 m , there is exactly one configuration X, denoted by X m and 
obeying the distribution p(X\j5 m ). The probability for the total product space is given 
by the product of the individual probabilities. We now do Markov chain Monte Carlo 
again with this product probability. In X-space, Metropolis Monte Carlo updates are 
performed for all /3 s independently. New configurations X m are obtained with the usual 
Metropolis random walk for /3 > 0, while a new sample is drawn directly from po(X) for 
/3 = 0. Alternated with the updates in X-space, Metropolis moves to swap configurations 
X m and X m+ \ at adjacent /3-values are performed. 

During the Monte Carlo run, all X — m will eventually get swapped to /3 = 0, where a 
new sample is drawn. This time, however, the random walk does not completely forget 
its past, which can be inferred from the Markov matrix for a similar toy situation as for 
Simulated Tempering above. Suppose, we have three /3-values /3o = 0, j8i, fa = 1, and 
the following temperature swaps occur in the Markov chain Monte Carlo: 

2 2 
1 -> -> 2 -> 1 -> 1 , 
112 2 

where a tilde means that an exact sample is drawn from p$. All Configurations have 
now been at /3 = 0, but the columns of the matrix corresponding to the above sequence 
of swaps are still not equal, which means that the current state of the Markov chain 
still depends on its initial state. However, these correlations are small after an initial 
thermalization and autocorrelation times are short. 



NEEDED PARAMETERS 

In order to do Simulated or Parallel Tempering, we have to adjust the values for the j8 m 
and the Z m , see Eq.(l). The /3 m -values have to be dense enough to give a considerable 
overlap of p(X\f5 m ) and p(X\fi m ±\). On the other hand, we want to have as few /3- 
values as possible between /3 = 1 and /3 = 0. The /3-values can be adjusted in a Parallel 
Tempering prerun, where a new value is inserted whenever the swapping rate between 
adjacent /3 s is too low. 

The ideal Z m needed for Simulated Tempering would make all /3-values equally likely. 
This prevents the Markov chain from spending too much time at on single temperature 
and thus speeds travel from /3 = to /3 = 1 and back again. This leads to: 

Z m oc j dX Pl (xf>"po(Xy-^ 
x 



For physical systems, the weight Z(j8) gives the partition function, which can only be 
determined in terms of Z(/3 = 0). For problems in data analysis, it is the model evidence, 
i.e. the probability for the chosen model integrated over all possible parameter values. 
The weights can be obtained from the visiting frequency for the /3 -values in Simulated 
Tempering preruns, but this is rather difficult, because they may differ by orders of 
magnitude. They are not needed for Parallel Tempering, where they cancel out, but the 
integral can still be calculated with a procedure similar to thermodynamical integration, 
see Ref. [6]. With the random samples produced at /3 m , we can estimate Z m+ \ for /3 m+ i: 

Z m+ i , Pi(X)^p (X) 1 -^ 

Z m ^ Pl (X)Pmp (Xy-P>n ^ 

where {■■■}p m denotes an expectation value calculated at /3 m . The integral Z(/3 = 1) is 
the product of all the measured ratios: 

M-l 7 

Z( J 8 = l)=Z M = Z -n% ±i - (7) 

Care must be taken in evaluating this quantity, because the configurations are inter- 
changed between j8 -values and the measurements obtained for the different j8 -values 
are therefore heavily correlated and the same applies to using parallel tempering data for 
multihistogrmming (Ref. [1 1]), as was proposed in Ref. [12]. 



BEHAVIOR IN HIGHER DIMENSIONS 

In this section, we examine the behavior of the Tempering algorithm in higher dimen- 
sions. We chose po as one single broad Gaussian with width Oq = 1 centered at X = 
and the wanted probability p\ consisted of two Gaussians of width o = 0.04 centered at 
X = (0.3, 0.3, . . . ) and X = (0.8, 0.8, . . . ), which were multiplied by 5000, so as to yield 
a norm n = 10000. 100 sweeps were performed between /3-moves, the j5 m and Z m were 
adjusted in a parallel tempering prerun. The geometric mean was used to insert new 
slices, except for finding the second lowest /3i > 0, where the old value was just divided 
by a constant, if the swapping rate was too low. As the number of needed /3 values Ng de- 
pends on the logarithm of the ratio of the volumes of p\ and po Nr <x — log ((<7i / Oq) d ) , 
the dependence on the dimension D is expected to be linear, which is indeed approxi- 
mately the case, as can be seen in Fig. 3. 

Figure 4 shows the number of MC updates needed for one independent sample. One 
sees that the increase in needed samples with the dimension of the problem approxi- 
mately obeys a power law in contrast to the behavior found for the CFTP algorithm 
(Ref. [9]), which has an exponential dependence on the dimension and generally sim- 
ilar performance as the rejection method, see Ref. [13]. For all presented dimensions, 
the results for the norm were consistent with the errorbars (see Fig. 5) and likewise the 
average for X, i. e. the simulation found both peaks. 
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FIGURE 3. Number Ng of needed j3 -values needed for various dimensions D. 
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FIGURE 4. Number of needed MC updates per independent sample for various dimensions D. 



APPLICATION TO THE ISING MODEL 

Although other choices could be more promising [14], we chose constant transition 
(Simulated Tempering) and swapping rates (Parallel Tempering, see Ref. [3]) between 
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FIGURE 5. Logarithm of the obtained norm log(Z(j3 = 1)) for various dimensions D. 



adjacent temperatures. This leads to: 



where E is the energy, C v the specific heat and kg the Boltzmann constant. This relation 
shows that we need denser /3-values where C v is large, i. e. near a (second order) phase 
transition. As the specific heat is small for low temperatures again, further cooling below 
the phase transition is easy. The specific heat is an extensive quantity, the number of 
needed temperatures is therefore expected to grow as Na ^jN sp i ns with N spins the 
number of spins. In this case, we used the arithmetic men to insert new /3-values. As 
expected, the number of needed temperatures scales proportional to the linear system 
size Np °z L= ^/N spins, see fig. 6. 
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FIGURE 6. Number Np of needed j3 -values needed for the two dimensional Ising model depending on 
the linear system size L = y/N sp j ns . 



x n + 




x CFTPT = T C 
_q_ CFTPT=1 <T C 
, CFTPT=1.25>T 
x Sim. Temp. T = t 
Q Sim. Temp. T=0.667<T, 
, Sim. Temp. T=1.25>T, 



FIGURE 7. Time per independent sample for the two dimensional Ising model depending on the linear 
system size L = ^N sp i n s- Solid lines: CFTP Method and single spin flip algorithm Dotted lines: Simulated 
Tempering and Swendsen Wang algorithm 

Figure 7 shows the time per independent sample for Exact Tempering and for the 
Coupling From The Past (CFTP) method with single spin flips introduced by Propp and 
Wilson in Ref. [8]. For CFTP, the time needed for an independent sample grows on a 
logarithmic scale with the system sizes for temperatures above the critical Tq, obeys a 



power law at Tq and grows exponentially for the ordered phase below Tq. There are 
CFTP schemes for the Swendsen Wang algorithm, but its runtime scales exponentially 
with the system size except at /3 = 0, T = oo and it is much slower than the single spin 
flip algorithm around Tq, see Ref. [13]. 

Exact Tempering with the Swendsen Wang algorithm, on the other hand, gives linear 
scaling for all temperatures. Although this is slower than CFTP for high temperatures, 
one does in fact get the high temperature results for free, because they are sampled 
anyway in a Tempering run for low temperatures. The critical exponent for Exact 
Tempering is two for both the Swendsen Wang (1.92 ± 0.06 at the lowest temperature) 
and the Wolff (1.98 ±0.07) algorithm. The reason for this is, that the time for an 
independent sample is determined by the number of steps needed to go from /3 = 
to /3 = fimax and back again and not by the algorithm used for the spin updates. This 
random walk in the temperatures scales proportional to the square of their number Np , 
which gives 

find ocNjocL 2 = N S pi ns . (9) 

This dependence % ind N spins breaks down for the single spin flip algorithm. Exact 
Tempering then becomes much slower, because the spin configurations at the critical 
temperature cannot be sampled as efficiently. This effect becomes more severe for 
first order transitions, where the algorithm does not manage the transition from the 
disordered to the ordered phase at Tq- 'Tempering' of a model parameter which carries 
the transition from first to second order might then be a solution. This was introduced 
in Ref. [15] for the Swendsen-Wang algorithm applied to the Potts model, where the 
variable 'tempering' parameter was the number of states q. Exact Tempering is also 
applicable to this variant, because the percolation problem for q = 1 can be sampled 
exactly. 



CONCLUSIONS 

This paper provides a discussion of Exact Sampling with Simulated Tempering [1] and 
compares it to Exact Sampling via the Propp-Wilson method [8]. The former is found to 
be advantageous in most cases. Simulated Tempering provides a way to draw exact, i.e. 
completely uncorrelated samples from arbitrary distributions in high dimensions. The 
peaks of multimodal densities are sampled with their respective weights. The parame- 
ters /3 m and Z m needed for the Simulated Tempering run can be adjusted in a Parallel 
Tempering prerun. While the Parallel Tempering algorithm itself does not provide per- 
fectly uncorrelated samples, its autocorrelation time is small. For practical purposes, it 
is a robust alternative, because it does not need the parameters Z m . Both methods allow 
to calculate the integral over the probability density, i.e. partition functions or model 
evidences. 
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